ECOLE D’ETE CIST 2022

Inventaire préparatoire des données

Claude Grasland (Université de Paris (Diderot), UMR 8504 Géographie-cités, FR 2007 CIST)

1 DONNEES GEOMEDIATIQUES

1.1 préparation des données

1.1.1 Importation du fichier csv

store <- "data/mediacloud"
  media <- "fr_BEN_tribun"
  type <-".csv"
  
  fic <- paste(store,"/",media,type,sep="")
  
  df<-read.csv(fic,
               sep=";",
               header=T,
               encoding = "UTF-8",
               stringsAsFactors = F)
  
  # eliminate duplicate
  df<-df[duplicated(df$title)==F,]
  
  
  kable(head(df))
stories_id publish_date title url language ap_syndicated themes media_id media_name media_url
1128961138 2019-01-01 09:55:07 Bénin : les vœux de Patrice Talon aux béninois https://lanouvelletribune.info/2019/01/benin-les-voeux-de-patrice-talon-aux-beninois/ fr False 39230 lanouvelletribune http://www.lanouvelletribune.info/
1129273235 2019-01-02 05:31:36 RDC : l’ultimatum des USA et de l’UE à Joseph Kabila https://lanouvelletribune.info/2019/01/rdc-lultimatum-des-usa-et-de-lue-a-joseph-kabila/ fr False 39230 lanouvelletribune http://www.lanouvelletribune.info/
1129306754 2019-01-02 05:48:16 Ali Bongo paralysé? la rumeur lancée par un site panafricain https://lanouvelletribune.info/2019/01/ali-bongo-paralyse-la-rumeur-lancee-par-un-site-panafricain/ fr False 39230 lanouvelletribune http://www.lanouvelletribune.info/
1129450307 2019-01-02 06:41:08 Bénin : Le président de la Criet craint que le procès ICC Services ne se consume dans la flamme du mensonge https://lanouvelletribune.info/2019/01/benin-le-president-de-la-criet-craint-que-le-proces-icc-services-ne-se-consume-dans-la-flamme-du-mensonge/ fr False 39230 lanouvelletribune http://www.lanouvelletribune.info/
1129450279 2019-01-02 07:50:03 Donald Trump : Il ne s’est pas montré à la hauteur du bureau Ovale, selon Mitt Romney https://lanouvelletribune.info/2019/01/donald-trump-il-ne-sest-pas-montre-a-la-hauteur-du-bureau-ovale-selon-mitt-romney/ fr False 39230 lanouvelletribune http://www.lanouvelletribune.info/
1129450236 2019-01-02 09:13:51 Emmanuel Macron: Il a échangé à deux reprises ” de manière laconique” avec Benalla https://lanouvelletribune.info/2019/01/emmanuel-macron-il-a-echange-a-deux-reprises-de-maniere-laconique-avec-benalla/ fr False 39230 lanouvelletribune http://www.lanouvelletribune.info/

1.1.2 Resolution of encoding problems

It is sometime possible to adapt manually the encoding problem whan they are not too much as in present example.

1.1.3 Transformation in quanteda format

We propose a storage based on quanteda format by just transforming the data that has been produced by readtext. We keep only the name of the source and the date of publication.

# Create Quanteda corpus
  qd<-corpus(df,docid_field = "stories_id")
  
  
  # Select docvar fields and rename media
  qd$when <-as.Date(qd$publish_date)
  qd$who <-media
  docvars(qd)<-docvars(qd)[,c("who","when")]
  
  
  
  
  # Add global meta
  meta(qd,"meta_source")<-"Media Cloud "
  meta(qd,"meta_time")<-"Download the 2021-09-30"
  meta(qd,"meta_author")<-"Elaborated by Claude Grasland"
  meta(qd,"project")<-"ANR-DFG Project IMAGEUN"
store <- "data/mediacloud"
  type<- ".RDS"
  myfile <- paste(store,"/",media,type,sep="")
  myfile
[1] "data/mediacloud/fr_BEN_tribun.RDS"
saveRDS(qd,myfile)
  qd[1:3]
Corpus consisting of 3 documents and 2 docvars.
  1128961138 :
  "Bénin : les vœux de Patrice Talon aux béninois"

  1129273235 :
  "RDC : l'ultimatum des USA et de l'UE à Joseph Kabila"

  1129306754 :
  "Ali Bongo paralysé? la rumeur lancée par un site panafricain"
summary(qd,3)
Corpus consisting of 20639 documents, showing 3 documents:

         Text Types Tokens Sentences           who       when
   1128961138     9      9         1 fr_BEN_tribun 2019-01-01
   1129273235    11     11         1 fr_BEN_tribun 2019-01-02
   1129306754    11     11         2 fr_BEN_tribun 2019-01-02

1.1.4 Back transformation to tibble

In the following steps, we will make an intensive use of quanteda, but sometimes it can be useful to export the results in a more practical format or to use other packages. For this reasons, it is important to know that the tidytextpackage can easily transform quanteda object in tibbles which are more classical and easy to manage and to export in other formats like data.frame or data.table.

text who when
Bénin : les vœux de Patrice Talon aux béninois fr_BEN_tribun 2019-01-01
RDC : l’ultimatum des USA et de l’UE à Joseph Kabila fr_BEN_tribun 2019-01-02
Ali Bongo paralysé? la rumeur lancée par un site panafricain fr_BEN_tribun 2019-01-02
Bénin : Le président de la Criet craint que le procès ICC Services ne se consume dans la flamme du mensonge fr_BEN_tribun 2019-01-02
Donald Trump : Il ne s’est pas montré à la hauteur du bureau Ovale, selon Mitt Romney fr_BEN_tribun 2019-01-02
Emmanuel Macron: Il a échangé à deux reprises de manière laconique avec Benalla fr_BEN_tribun 2019-01-02

1.2 Geographical tags

The aim of this section is to add to the quanteda corpus different metadata related to the geographical entities that are mentioned in the news. We do not discuss here the problems related to the choice of a list of entities and we just apply a method of recognition based on a dictionary. It is theoretically possible to recognize a great number of spatial entities (regions, continents, cities, …) put we will limit here our research to the case of states recognized at UN and some adding territories partly recognized like Kosovo, Northern Cyprus or Taiwan.

1.2.1 Load dictonary

We start by loading the last version of the Imageun dictionary and we extract our target language (here : french).

   ISO3           x lang
  1:  ABK     abkhaz*   fr
  2:  AFG     afghan*   fr
  3:  AFG      kaboul   fr
  4:  AFG afghanistan   fr
  5:  AGO      angola   fr
  6:  AGO   angolais*   fr

1.2.2 Load corpus

1.2.3 Load tagging function

1.2.4 Annotate all entities

In a first step, we annotate all geographic entities together in order to benefit from the cross-definition of their respective compounds. We will separate them by subcategories in a second step.

[1] "Program executed in  1.73629403114319"

      0     1     2     3     4     5 
  13089  6430   992   112    15     1 

1.2.5 check news with maximum state number


      0     1     2     3     4     5 
  13089  6430   992   112    15     1 
who when text states nbstates
1176232825 fr_BEN_tribun 2019-01-27 Venezuela : la Russie accuse les USA de vouloir agir comme en Irak et en Libye VEN RUS USA IRQ LBY 5
1183466572 fr_BEN_tribun 2019-02-05 Venezuela : L’Italie, la Grèce, la Belgique et l’Irlande disent niet à Macron et Guaido VEN ITA BEL IRL 4
1191253445 fr_BEN_tribun 2019-02-11 UE : la Chine, la Russie, les USA et le Maroc accusés d’espionnage CHN RUS USA MAR 4
1338873613 fr_BEN_tribun 2019-07-15 Iran : la France, l’Allemagne et le Royaume-Uni lancent un appel à Trump et Rohani IRN FRA DEU GBR 4
1391394085 fr_BEN_tribun 2019-09-13 Pétrole iranien à la Syrie : après les anglais, les USA en colère IRN SYR GBR USA 4
1391769004 fr_BEN_tribun 2019-09-13 Annonces de l’Iran : la France, l’Allemagne et le Royaume-Uni préoccupés IRN FRA DEU GBR 4
1417745234 fr_BEN_tribun 2019-10-13 Syrie : Après l’offensive turque, Trump menace la France et l’Allemagne SYR TUR FRA DEU 4
1475893708 fr_BEN_tribun 2019-12-20 En pleine tension avec les USA, l’Iran va s’entraîner avec la Russie et la Chine USA IRN RUS CHN 4
1648884614 fr_BEN_tribun 2020-06-30 Libye : Ankara accuse la France de chercher à renforcer la présence russe LBY TUR FRA RUS 4
1689050167 fr_BEN_tribun 2020-08-21 Sanctions contre l’Iran : Paris, Berlin et Londres osent s’opposer à Donald Trump IRN FRA DEU GBR 4
1733113028 fr_BEN_tribun 2020-10-08 Iran, Turquie : l’erreur stratégique des palestiniens selon un saoudien IRN TUR PSE SAU 4
1773667257 fr_BEN_tribun 2020-11-19 Cybermenaces : le Canada se dit visé par la Chine, la Russie, l’Iran et la Corée du Nord CAN CHN RUS IRN 4
1782371772 fr_BEN_tribun 2020-11-28 Nucléaire iranien : les USA sanctionnent des sociétés chinoises et russes IRN USA CHN RUS 4
1837356392 fr_BEN_tribun 2021-01-29 En Libye, les USA veulent reprendre la main face aux turcs et aux russes LBY USA TUR RUS 4
1850495176 fr_BEN_tribun 2021-02-12 Nucléaire iranien : l’avertissement de Paris, Londres et Berlin aux iraniens IRN FRA GBR DEU 4
1875341190 fr_BEN_tribun 2021-03-10 L’Allemagne veut s’allier aux USA pour contrer la Russie et la Chine DEU USA RUS CHN 4

1.3 Topic tags

1.3.1 Dictionary

We decide here to use lower case transformation. We use a star for the words that can take a plural form.

code lang label
pand fr épidémie*
pand fr pandémie*
pand fr virus
pand fr oms
pand fr ébola
pand fr ebola
pand fr h1n1
pand fr sras
pand fr chikungunya
pand fr choléra
pand fr peste
pand fr covid*
pand fr coronavir*
pand fr ncov*

1.3.2 Annotation

1.3.3 Visualization

1.3.4 Save thematically anotated corpus

who when where1 where2 what tags news
fr_BEN_tribun 2019-01-01 no no NA 1 1.00
fr_BEN_tribun 2019-01-02 COD COD NA 2 1.25
fr_BEN_tribun 2019-01-02 COD USA NA 1 0.25
fr_BEN_tribun 2019-01-02 USA COD NA 1 0.25
fr_BEN_tribun 2019-01-02 USA USA NA 1 0.25
fr_BEN_tribun 2019-01-02 no no NA 10 10.00

2 Hypercubes exploration

2.1 Objectives

The different dimensions of an hypercube can be analysed through different aggregation of the dimensions of the hypercubes, leading to different tables authorizing different modes of visualization. Each function is named according to the dimensions that are combined. Each function will produce two different outputs, a statistical table and an interactive graphic

2.1.1 Statistical table

Whatever the dimensions we decide to cross, we build a table where we realize a statistical test in order to identify the cells that are characterized by positive or negative outliers i.e. cells where the phenomena of interest (WHAT) is significantly more present or less present than usual. More precisely, the function will produce two for each cell of the cross dimension table :

  • a salience index (Xobs/Xest) : defined as the ratio between observed and estimated number of news where the topic is present.
  • an outlier index (prob (Xobs > Xest)) : defined as the probability that the number of news where the topic is present is significantly greater than expected.

In both cases we introduce two parameters of control that will limit the computation of indexes to the cells where it appears statistically relevant to realize the measure :

  • Minimum sample size (minsamp) : is the total number of news present in the cell before to compute the probability of apparition of the topic. The default value is equal to 20 as we consider as not meaningfull to compute a proportion on a smaller sample.

  • Minimum estimated value (mintest): is the threshold of computation of the chi-square test according to the estimated number of news where the topic is present. According to statistical rules of the chi-square test, this threshold should be equal to 5 for optimal conditions of application. The package R introduce indeed a warning message if the condition is not satisfied, which can increase the time of computation.

Of course, the user can decide to relax or reinforce these two conditions but it is normally better to avoid to do it. When conditions are not fulfilled, the graphic output will not display the cells where the indexes can not be computed.

The function that realize the test is the following one

2.1.2 Interactive graphic

Once the statistical table has been computed, the user can choose between two different visualizations, based on the salience index (exploration) or the chi-square test (ouliers detection). In both case the result will be an interactive figure realized in plotly where it is possible to click on each cell and have a look at the statistical parameters.

The user interested in static graphic (e.g. for publication) can easily adapt the program and realize new functions, for example in ggplot2.

In order to illustrate each type of graphic, we will choose the example of the topic of mobility without distinction between migrants and refugees.

2.2 Topic frequence (What ?)

The first function has only one dimension and evaluate the proportion of news related to the topic. As a consequence, this function is not associated to a statistical test and return only a table and a graphic presenting the proportion of news where the topic is present or not.

2.2.1 Function

2.2.2 Example

   what  news pct
  1:   NA 20639 100

2.3 Topic variation by media (who.what)

The function who.what explore the variation of interest for the topic in the different media of the corpus.

2.3.1 Function

2.3.2 Example

We present here the statistical table and the two types of graphics that can be produced. In the following case we will only present the outlier graphic.

The analysis reveal a clear over-representation of the topic in the french newspaper Le Figaro (4.37% of news) as compared to the other media (2.1 to 2.5%).

2.4 Topic variation through time (when.what)

In this case we want to analyze if the topic has been more or less present at one period of time or another. It can therefore be interesting to modify the level of agregation before to do that and transform the initial hypercube (by day) toward another level of agregation. It is also possible to change the size of the time period as the outlier are defined by reference to the whole period of analysis

2.4.1 Function

2.4.2 Example 1 : 2014-2015 by month

The analysis reveals clear discontinuities in the timeline of the topic. We start with a low level (0.5 to 1.2%) from January 2014 to March 2015, followed by a brutal jump in April-June 2015 (3 to 5%) and a major peak in september 2015 (15.8% of news). At the end of the period, the level is clearly higher than at the beginning.

2.5 Topic variation through space (where.what)

This function analyze the spatial distribution of places associated to the topic. As we have only collected states, we do not take into account the news where the topic of interest is associated to geographical area different from states (e.g. “migrants from subsaharan Africa”). But it is only a minority of cases and the fact to collect states make possible to produce easily a geographical map of the phenomena.

2.5.1 Function

2.5.2 Example

When we realize the map, we eliminate the news related to the topic where no countries has been mentioned. As a consequence the reference value is modified : in the whole sample 2.73% of news was related to the topic but in the sample of news where one country is mentioned 2.83% of the news are related the topic.

As the total number of news can be small in some countries, we have reduced here the parameters of the statistical test in order to visualize more countries on the map. It is therefore necessary to be cautious in the analysis of results.

The analysis reveals that some countries are “specialized” in the topic during the period of observation. For example 53.5% of the news about Hungary was associated to the question of migrants and refugees, which is obviously related to the mediatization of the wall established by Viktor Orban in 2015. Other countries are characterized on the contrary by an under-representation of the topic like the USA where the topic is only associated to 0.7% of news. But the situation will change after Donald Trump’s election who will also establish a wall which will dramatically increase the number of news about USA and migrants.

2.6 Crossing 3 dimensions ?

Bibliographie

BARNIER, Julien, 2021. rmdformats: HTML Output Formats and Templates for ’rmarkdown’ Documents [en ligne]. S.l. : s.n. Disponible à l'adresse : https://github.com/juba/rmdformats.
R CORE TEAM, 2020. R: A Language and Environment for Statistical Computing [en ligne]. Vienna, Austria : R Foundation for Statistical Computing. Disponible à l'adresse : https://www.R-project.org/.
XIE, Yihui, 2020. knitr: A General-Purpose Package for Dynamic Report Generation in R [en ligne]. S.l. : s.n. Disponible à l'adresse : https://CRAN.R-project.org/package=knitr.

Annexes

Infos session

setting value
version R version 4.1.0 (2021-05-18)
os Windows 10 x64
system x86_64, mingw32
ui RTerm
language (EN)
collate French_France.1252
ctype French_France.1252
tz Europe/Paris
date 2021-12-17
package ondiskversion source
cowplot 1.1.1 CRAN (R 4.1.1)
data.table 1.14.0 CRAN (R 4.1.0)
dplyr 1.0.6 CRAN (R 4.1.0)
DT 0.18 CRAN (R 4.1.0)
FactoMineR 2.4 CRAN (R 4.1.0)
ggplot2 3.3.3 CRAN (R 4.1.0)
knitr 1.34 CRAN (R 4.1.1)
leaflet 2.0.4.1 CRAN (R 4.1.1)
mapsf 0.2.0 CRAN (R 4.1.0)
mapview 2.10.0 CRAN (R 4.1.1)
plotly 4.9.4.1 CRAN (R 4.1.0)
quanteda 3.0.0 CRAN (R 4.1.0)
RColorBrewer 1.1.2 CRAN (R 4.1.0)
rmarkdown 2.11 CRAN (R 4.1.1)
rnaturalearth 0.1.0 CRAN (R 4.1.2)
rnaturalearthdata 0.1.0 CRAN (R 4.1.2)
rzine 0.1.0 gitlab ()
sf 1.0.0 CRAN (R 4.1.0)
stargazer 5.2.2 CRAN (R 4.1.0)
tidyr 1.1.3 CRAN (R 4.1.0)
tidytext 0.3.1 CRAN (R 4.1.1)
wbstats 1.0.4 CRAN (R 4.1.2)

Citation

@Manual{ficheRzine,
    title = {Titre de la fiche},
    author = {{Auteur.e.s}},
    organization = {Rzine},
    year = {202x},
    url = {http://rzine.fr/},
  }


Glossaire